05. Data Cleaning Process
The Process
The very first thing to do before any cleaning occurs is to make a copy of each piece of data. All of the cleaning operations will be conducted on this copy so you can still view the original dirty and/or messy dataset later. Copying DataFrames in pandas is done using the copy method. If the original DataFrame was called df, the soon-to-be clean copy of the dataset could be named df_clean.
df_clean = df.copy()
Note that simply assigning a DataFrame to a new variable name leaves the original DataFrame vulnerable to modifications, as explained in the answers to this Stack Overflow question: "Why should I make a copy of a DataFrame in pandas?"
Data Cleaning Process
An Example
Note: a copy of the original dataset was not made before cleaning in the following example, though one should have been.
Data Cleaning Process
Quiz
Using the snapshot of the patients table and the output of patients.info() below, answer the following matching quiz.
Snapshot of the patients table
Output of patients.info()
Data Cleaning Process
QUIZ QUESTION::
Match each statement below to the appropriate step of the data cleaning process for the zip code issues in the patients_clean table (a copy of the patients table).
ANSWER CHOICES:
|
Data Cleaning Step |
Statement |
|---|---|
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0') |
|
patients_clean.zip_code.head() |
|
Convert the zip code column's data type from a float to a string using |
|
Zip code is a float not a string |
|
Zip code has four digits sometimes |
SOLUTION:
|
Data Cleaning Step |
Statement |
|---|---|
|
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0') |
|
|
patients_clean.zip_code.head() |
|
|
Convert the zip code column's data type from a float to a string using |